Practical: outbreak reconstruction
Introduction
On the tidyverse
The tidyverse is a collection of R packages designed for data science. Developed with high standards of coding practices and software development, these packages form a consistent ecosystem to handle and visualise data. In this practical, we will mostly rely on the following packages:
dplyr for handling data, making new variables, building summaries, number-crunching
tidyr to re-arrange tidy/data
ggplot2 to visualise data
magrittr for the piping operator (
%>%)
Most of these functionalities are summarised in handy cheatsheets. We provide links to the most relevant ones below:
basics of R (not tidyverse, but they remain very useful)
Other packages
Other packages we will use include:
rmarkdown for automated report generation
rio to read
.xlsxfiles inape for phylogenetic reconstruction
Cheatsheets and other resources:
the rmarkdown cheatsheet
the knitr website documenting many options used in rmarkdown’s settings and code chunks
On the RECONverse
Several RECON packages will be used in the practical:
linelist for data cleaning
incidence older package to build and visualise epidemic curves, and do basic model fitting; it is being replaced by incidence2 but some older packages like earlyR (for estimating the reproduction number) and projections (for short term forecasting) are still using it.
epicontacts to visualise and analyse contact tracing data or transmission chains
epitrix to estimate delay distributions
distcrete to handle delay distributions
EpiEstim for estimation of the reproduction number over time
outbreaker2 for outbreak reconstruction using multiple data sources
Note that earlyR and projections will soon be adapted to handle incidence2 objects.
For some of these packages, we recommend using the github (development) version, either because it is more up-to-date, or because they have not been released on CRAN yet.
What we will do today
This practical will follow up on the response to the fictitious Ebola outbreak in Morporkia. It will build and consolidate material seen in previous sessions, and will show how genetic data can be used to reconstruct transmission trees (who infects whom?), especially when combined with epidemiological data. We will particularly look at:
- cleaning / standardising ‘dirty’ data
- building epidemic curves
- estimate serial interval distributions of contact data
- build new distributions using parameters reported in the literature
- visualise and analyse transmission chains / contact tracing data
- estimate growth rates from epidemic curves
- estimate time varying reproduction number
- read in genetic data
- make a simple phylogenetic tree
- combine epidemiological and genetic data to reconstruct transmission trees
In addition, we strongly recommend storing all analyses into one or several rmarkdown reports.
How this document works
This practical will take you through the reconstruction of a fictitious Ebola outbreak using various sources of data. It is pitched at an intermediate level, so that a fair fraction of the code is to be generated by the student. Wherever this is the case, code seen in previous practicals should come in handy. In all instances, try to first answer your own questions: try/error is a great way to learn R. But if you feel stuck, do not hesitate to questions to the demonstrators: they are here to help.
It is important that you understand what the code does. Look at relevant documentation, try tweaking the code, experiment, and do not hesitate to ask questions. It is not essential to go through all examples to illustrate important points, so take your time, especially if you are new to R. Also note that the solutions provided are by not means the only options for solving a given problem - feel free to explore alternatives and discuss them with the trainers.
An update on the EVD outbreak in Ankh, Republic of Morporkia
This practical is the second part of the response to a simulated Ebola Virus Disease (EVD) outbreak taking place in the city of Ankh, Republic of Morporkia. While the first part focussed on early assessments of transmissibility, this part explores more methodological options for estimating transmissibility, and provides an introduction to outbreak reconstruction using outbreaker2.
After some rather concerning on the new EVD outbreak in the city of Ankh, Republic of Morporkia, Public Health Morporkia (PHM) has sent you updated linelists and contact data. This time, PHM has also obtained Whole Genome Sequences (WGS) of the Ebola virus isolated in patients. As before, you are asked to assess the situation and produce evidence-based recommendations for informing the response.
The new data
The data update includes new linelists and contact lists:
PHM-EVD-linelist-2017-12-01.xlsx: a linelist containing case information up to the 1st December 2017
PHM-EVD-contacts-2017-12-01.xlsx: a list of contacts reported between cases up to the 1st December 2017, where
fromindicates a potential source of infection, andtothe recipient of the contact
Exercise: Re-using code you used previously, import and clean the data. We recommend the following steps:
create a
data/folder in your project directorydownload and save the data files in
data/import the files using
here::hereto find the path to the data, andrio::importto read the files and import them into R; save the linelist as an object calledlinelist_raw, and contacts ascontacts_rawimport the cleaning rules from the corresponding excel file and save the object as
cleaning_rulesuse linelist’s function to clean the data, using
guess_dateson the relevant columns to make sure dates are useable, and the cleaning rules defined incleaning_rules; save the clean (tibble) data aslinelistandcontacts
Once imported and cleaned, the data should look like:
# linelist: one line per case
linelist
## # A tibble: 50 × 6
## case_id date_of_onset date_of_report sex age location
## <chr> <date> <date> <chr> <dbl> <chr>
## 1 39e9dc 2017-10-10 2017-10-22 female 62 pseudopolis
## 2 664549 2017-10-16 2017-10-24 male 28 pseudopolis
## 3 b4d8aa 2017-10-17 2017-10-23 male 54 ankh_morpork
## 4 51883d 2017-10-18 2017-10-22 male 57 pseudopolis
## 5 947e40 2017-10-20 2017-10-25 female 23 ankh_morpork
## 6 9aa197 2017-10-20 2017-10-23 female 66 pseudopolis
## 7 e4b0a2 2017-10-21 2017-10-24 female 13 ankh_morpork
## 8 af0ac0 2017-10-21 2017-10-26 male 10 ankh_morpork
## 9 185911 2017-10-21 2017-10-26 female 34 ankh_morpork
## 10 601d2e 2017-10-22 2017-10-28 male 11 ankh_morpork
## # … with 40 more rows
# contacts: pairs of cases with reported contacts
contacts
## # A tibble: 36 × 2
## source_case_id case_id
## <chr> <chr>
## 1 9aa197 426b6d
## 2 39e9dc a8e9d8
## 3 c2a389 95fc1d
## 4 51883d 778316
## 5 51883d e37897
## 6 933811 99abbe
## 7 51883d 185911
## 8 b4d8aa e4b0a2
## 9 605322 1875e2
## 10 b4d8aa b5ad13
## # … with 26 more rowsAnalysis of contacts and transmissibility
Analysis of contact data
After the initial stage of the outbreak, contact tracing has been maintained but started being sparser, and exposures haven’t been reported for all cases. Despite these limitations, contacts are still a valuable source of information. Using the function make_epicontacts in the epicontacts package, create a new epicontacts object called x, specifying that contacts are directed. When plotting the data, use the arguments node_shape and shapes (see ?vis_epicontacts) to distinguish males and females. Also, use node_color to distinguish individuals from Pseudopolis and Ankh Morpork. For a list of available symbols and corresponding shape code, you can type epicontacts::codeawesome.
The results should look like:
##
## /// Epidemiological Contacts //
##
## // class: epicontacts
## // 50 cases in linelist; 36 contacts; directed
##
## // linelist
##
## # A tibble: 50 × 6
## id date_of_onset date_of_report sex age location
## <chr> <date> <date> <chr> <dbl> <chr>
## 1 39e9dc 2017-10-10 2017-10-22 female 62 pseudopolis
## 2 664549 2017-10-16 2017-10-24 male 28 pseudopolis
## 3 b4d8aa 2017-10-17 2017-10-23 male 54 ankh_morpork
## 4 51883d 2017-10-18 2017-10-22 male 57 pseudopolis
## 5 947e40 2017-10-20 2017-10-25 female 23 ankh_morpork
## 6 9aa197 2017-10-20 2017-10-23 female 66 pseudopolis
## 7 e4b0a2 2017-10-21 2017-10-24 female 13 ankh_morpork
## 8 af0ac0 2017-10-21 2017-10-26 male 10 ankh_morpork
## 9 185911 2017-10-21 2017-10-26 female 34 ankh_morpork
## 10 601d2e 2017-10-22 2017-10-28 male 11 ankh_morpork
## # … with 40 more rows
##
## // contacts
##
## # A tibble: 36 × 2
## from to
## <chr> <chr>
## 1 9aa197 426b6d
## 2 39e9dc a8e9d8
## 3 c2a389 95fc1d
## 4 51883d 778316
## 5 51883d e37897
## 6 933811 99abbe
## 7 51883d 185911
## 8 b4d8aa e4b0a2
## 9 605322 1875e2
## 10 b4d8aa b5ad13
## # … with 26 more rows
What can you say about these contacts? How would you interpret the different clusters?
Question / exercise:
What can you say about these contacts? How would you interpret the different clusters?
A colleague, epidemiologist at PHM, regularly states during meetings that Pseudopolis is the main ‘source’ of new infections (‘seeds’) into Ankh Morpork. Test this hypothesis (see
?get_pairwise). What do you think?
Estimating growth rates from epidemic curves
Exercise:
Using approaches seen in previous practicals, build epidemic curves, and estimate the growth rate for the two cities separately, using the entire time series in each case. What results do you find? Can these results be trusted? Why?
Tips on estimating growth rates
There are two options for estimating the growth rates as requested:
the old way: use the incidence package to build the epidemic curves, and the function
fitto estimate the growth rates using a log-linear regression, with the caveat that days with 0 incidence are ignoredthe new way: use incidence2 to build the epidemic curves, and the function
fit_curvefrom the i2extras package, specifying a Poisson model; you can then merely plot the results usingplot, and derive corresponding growth rates and doubling/halving times usinggrowth_rate
Estimating time-varying reproduction numbers
When the assumption that \(R\) is constant over time becomes untenable, an alternative is the estimationg of time-varying transmissibility using the instantaneous reproduction number \(R_t\). This approach, introduced by Cori et al. (Cori et al. 2013), is implemented in the package EpiEstim (function estimate_R). It estimates \(R_t\) for a succession of sliding time windows, using the same Poisson branching process model as in earlyR. In the following, we:
estimate the serial interval by fitting a discretized Gamma distribution to the empirical data derived from the transmission chains
build an
incidenceobject to serve as input toestimate_R; note that newincidence2objects will soon be possible inputs forestimate_Ras welluse
estimate_Rto estimate the time-varying reproduction number
library(incidence)
library(EpiEstim)
# estimate the serial interval
si <- x %>%
get_pairwise("date_of_onset") %>% # extract empirical data from chains
epitrix::fit_disc_gamma() # fit discretised Gamma
si
## $mu
## [1] 13.09018
##
## $cv
## [1] 0.6489035
##
## $sd
## [1] 8.494266
##
## $ll
## [1] -122.525
##
## $converged
## [1] TRUE
##
## $distribution
## A discrete distribution
## name: gamma
## parameters:
## shape: 2.37486979332248
## scale: 5.51195897078819
Rt <- incidence::incidence(linelist$date_of_onset) %>%
estimate_R(method = "parametric_si",
config = make_config(mean_si = si$mu,
std_si = si$sd))
plot(Rt)Question:
What do you make of these results? What do you think is the probable course of the outbreak if nothing changes?
Finding who infected whom
To gain a better understanding of the transmission process, we can attempt to reconstruct plausible transmission trees using the dates of symptom onsets and limited contact data. This can be achieved using outbreaker2 (Campbell, Didelot, et al. 2018), which provides a modular platform for outbreak reconstruction. This package extends and replaces outbreaker (Jombart et al. 2014), which in contrast was a static implementation of a specific transmission model (Jombart et al. 2014).
Looking at Whole Genome Sequences (WGS)
WGS have been obtained for all cases in this outbreak. They are stored as a the fasta file phm_evd_wgs.fa). Download this file, save it in the data/ folder, and then run the code below to import the data in R:
library(ape)
dna_file <- here::here("data", "phm_evd_wgs.fa")
dna <- read.FASTA(dna_file)
dna
## 50 DNA sequences in binary format stored in a list.
##
## All sequences of same length: 18958
##
## Labels:
## 39e9dc
## 664549
## b4d8aa
## 51883d
## 947e40
## 9aa197
## ...
##
## Base composition:
## a c g t
## 0.250 0.248 0.248 0.254
## (Total: 947.9 kb)
identical(labels(dna), linelist$case_id) # check sequences match linelist data
## [1] TRUEAs a first exploration of the data, we derive a Neighbour-Joining tree rooted at the first case of the outbreak:
nj <- nj(dist.dna(dna, model = "N")) # NJ on nucleotide distances (model = "N")
nj
##
## Phylogenetic tree with 50 tips and 48 internal nodes.
##
## Tip labels:
## 39e9dc, 664549, b4d8aa, 51883d, 947e40, 9aa197, ...
##
## Unrooted; includes branch lengths.
nj <- root(nj, 1)
plot(nj, main = "Neighbour Joining tree")
axisPhylo()This phylogenetic tree shows the inferred evolution of the pathogen sequences. Branch length (x-axis) correspond to the number of mutations occuring between lineages (indicated by the axis at the bottom). The tree has been rooted to the index case, so that this sequence (top, left) is the “most ancient” part of the tree. Note that in such representations, distances on the y-axis are meaningless.
Question:
How would you interpret this phylogenetic tree? Many methods of outbreak reconstruction infer transmission events from phylogenies. What results would you expect here?
Building delay distributions
outbreaker2 can handle different types of dates. When dates of onset are provided, information on the generation time (delay between primary and secondary infections) and on the incubation period (delay between infection and symptom onset) can be included in the model. These delays are typically modelled as Gamma distributions, which need to be discretised in order to account for the fact that time is reported as days.
Here, we will use the serial interval estimated from the transmission pairs as a proxy for the generation time. For the incubation period, your colleagues from PHM report that they observed average incubation times of 10.6 days with a standard deviation of 7.4 days. To build a discretized Gamma distribution from this data, we will:
use epitrix’s function
gamma_mucv2shapescaleto convert these parameters into shape and scale for a Gamma distributionthen use
distcreteto generate discretised distributions using shape and scale
library(epitrix)
# find parameters for Gamma
incub_mu <- 10.6
incub_sd <- 7.4
incub_cv <- incub_sd / incub_mu
incub_params <- gamma_mucv2shapescale(incub_mu, incub_cv)
# build discretized gamma
incub <- distcrete::distcrete("gamma",
shape = incub_params$shape,
scale = incub_params$scale,
w = 0.5, interval = 1)
incub
## A discrete distribution
## name: gamma
## parameters:
## shape: 2.05186267348429
## scale: 5.16603773584906
# check resulting distribution
tibble(days = 0:30, p = incub$d(0:30)) %>%
ggplot(aes(x = days, y = p)) +
geom_col() +
theme_bw() +
labs(title = "Incubation period distribution",
x = "Days from infection to onset of symptoms",
y = "Probability")Similarly, we can visualise the serial interval distribution previously estimated from the transmission chains:
tibble(days = 0:50, p = si$distribution$d(0:50)) %>%
ggplot(aes(x = days, y = p)) +
geom_col() +
theme_bw() +
labs(title = "Serial interval distribution",
x = "Days between onsets of primary and secondary cases",
y = "Probability")Using the original outbreaker model
The original outbreaker model combined temporal information (here, dates of onset) with sequence data to infer who infected whom. Here, we use outbreaker2 to apply this model to the data.
All inputs to the new outbreaker function are prepared using dedicated functions, which make a number of checks on provided inputs and define defaults:
library(outbreaker2)
data <- outbreaker_data(
dates = linelist$date_of_onset, # dates of onset
dna = unname(dna), # WGS; remove labels for compatibility
w_dens =si$distribution$d(1:100), # generation time distribution
f_dens = incub$d(1:100) # incubation period distribution
)We also create a configuration, which determines different aspects of the analysis, including which parameters need to be estimated, initial values of parameters, the length of the MCMC, etc.:
config <- create_config(
move_kappa = FALSE, # don't look for missing cases
move_pi = FALSE, # don't estimate reporting
init_pi = 1, # set reporting to 1
find_import = FALSE, # don't look for additional imported cases
init_tree = "star" # star-like tree as starting point
)We can now run the analysis. This should take a couple of minutes on modern laptops. Note the use of set.seed(0) to have identical results across different users and computers:
set.seed(0)
res_basic <- outbreaker(data = data, config = config)
res_basic
##
##
## ///// outbreaker results ///
##
## class: outbreaker_chains data.frame
## dimensions 201 rows, 158 columns
## ancestries not shown: alpha_1 - alpha_50
## infection dates not shown: t_inf_1 - t_inf_50
## intermediate generations not shown: kappa_1 - kappa_50
##
## /// head //
## step post like prior mu pi eps lambda
## 1 1 -1959.8477 -1962.1502 2.302485 1.000000e-04 1 0.5 0.05
## 2 50 -814.8591 -817.1616 2.302530 5.512792e-05 1 0.5 0.05
## 3 100 -818.4481 -820.7506 2.302543 4.201736e-05 1 0.5 0.05
##
## ...
## /// tail //
## step post like prior mu pi eps lambda
## 199 9900 -806.4059 -808.7085 2.302530 5.487824e-05 1 0.5 0.05
## 200 9950 -805.8182 -808.1208 2.302544 4.148942e-05 1 0.5 0.05
## 201 10000 -807.9692 -810.2718 2.302535 4.962910e-05 1 0.5 0.05
plot(res_basic)Exercise:
The first two plots show the trace of the log-posterior densities (with, and without burnin). Look at other possible plots documented in ?plot.outbreaker_chains, and reproduce the graphs below to show results on:
- ancestries
- infection dates
- mutation rate
- transmission tree
As a further help for interpretation, you can derive a consensus tree from the posterior samples of trees using summary. Look in particular at the support column, and compare the results to the contact data.
smry_basic <- summary(res_basic)
head(smry_basic$tree)
## from to time support generations
## 1 NA 1 -10 0.2089552 NA
## 2 1 2 -1 0.4577114 1
## 3 1 3 1 0.3830846 1
## 4 1 4 -2 0.3532338 1
## 5 1 5 -4 0.5422886 1
## 6 4 6 4 0.4726368 1
tail(smry_basic$tree)
## from to time support generations
## 45 22 45 28 0.1691542 1
## 46 30 46 32 0.4278607 1
## 47 34 47 31 0.8358209 1
## 48 40 48 29 0.1890547 1
## 49 32 49 36 0.3333333 1
## 50 40 50 34 0.1741294 1
hist(smry_basic$tree$support, col = "grey", border = "white",
main = "Consensus ancestry: support", xlim = c(0,1))Exercise: How would you interpret the results? Is this what you would have expected? As a point of comparison, repeat the same analysis using temporal data only, and plot a graph of ancestries (type = "alpha"); you should obtain something along the lines of:
Question: What is the usefulness of temporal and genetic data for outbreak reconstruction? Note that this topic has been discussed before (Campbell, Strang, et al. 2018). What other data would you ideally include?
Adding contact data to the reconstruction process
outbreaker2 also implements a module for taking contact data into account when reconstructing transmission chains (Campbell et al. 2019). Contact data currently contains case labels. While epicontacts objects will soon be accepted as inputs in outbreaker2, for now we need to operate some minor transformations to define contacts using cases indices rather than labels:
ctd <- matrix(match(unlist(x$contacts), linelist$case_id), ncol = 2)
head(ctd)
## [,1] [,2]
## [1,] 6 28
## [2,] 1 15
## [3,] 30 46
## [4,] 4 23
## [5,] 4 13
## [6,] 41 43
dim(ctd)
## [1] 36 2All inputs to the outbreaker function are prepared using dedicated functions, which make a number of checks on provided inputs and define defaults:
data <- outbreaker_data(
dates = linelist$date_of_onset, # dates of onset
dna = unname(dna), # dna sequences
ctd = ctd, # contact data
w_dens =si$distribution$d(1:100), # generation time distribution
f_dens = incub$d(1:100) # incubation period distribution
)We are now ready to run the analysis. This may take a couple of minutes, depending on your computer:
set.seed(0)
res_full <- outbreaker(data = data, config = config)
res_full
##
##
## ///// outbreaker results ///
##
## class: outbreaker_chains data.frame
## dimensions 201 rows, 158 columns
## ancestries not shown: alpha_1 - alpha_50
## infection dates not shown: t_inf_1 - t_inf_50
## intermediate generations not shown: kappa_1 - kappa_50
##
## /// head //
## step post like prior mu pi eps lambda
## 1 1 -2133.4925 -2135.7950 2.302485 1.000000e-04 1 0.5000000 0.050000000
## 2 50 -879.8242 -882.1267 2.302529 5.583594e-05 1 0.5830505 0.009765158
## 3 100 -842.5666 -844.8691 2.302535 5.002914e-05 1 0.8447734 0.002261452
##
## ...
## /// tail //
## step post like prior mu pi eps lambda
## 199 9900 -840.6293 -842.9318 2.302527 5.781234e-05 1 0.7711070 0.001722643
## 200 9950 -836.3669 -838.6694 2.302537 4.855279e-05 1 0.6657835 0.001082511
## 201 10000 -842.4907 -844.7932 2.302524 6.097349e-05 1 0.5960041 0.002275266Produce graphics as in the previous model. Assess convergence, choose an appropriate burnin, visualise ancestries and the infection timelines:
Question: How would you interpret the results? Derive a consensus tree using summary, and make a new epicontacts object, using the previous linelist, to visualise the consensus tree with meta-information:
smry_full <- summary(res_full)
head(smry_full$tree)
## from to time support generations
## 1 NA 1 -8 NA NA
## 2 1 2 1 0.6517413 1
## 3 1 3 1 1.0000000 1
## 4 1 4 -1 0.6616915 1
## 5 1 5 -1 1.0000000 1
## 6 4 6 5 0.6716418 1
smry_full$tree$support <- round(smry_full$tree$support, 2)
linelist$id <- 1:nrow(linelist) # add case index to linelist
cons_tree <- make_epicontacts(
linelist,
smry_full$tree[-1, ],
id = "id",
from = 1,
to = 2,
directed = TRUE)In the following, we add age class information to the linelist of cons_tree, and create color palettes which will be used to display information on the final graph:
support_pal <- colorRampPalette(
c("#918D98", "#645877", "#423359", "#281449", "#1A0340")
)
age_pal <- colorRampPalette(
c("#3288BD", "#ABDDA4", "#FDAE61", "#D53E4F")
)
cons_tree$linelist$age_class <- cut(
cons_tree$linelist$age,
breaks = c(0, 10, 20, 30, 40, 90),
labels = c("0-10", "11-20", "21-30", "31-40", "41+" ))Looking carefully at the documentation of vis_epicontacts, try to reproduce the final consensus tree below:
Question: What are your conclusions? What are the main drivers of this outbreak? What recommendations would you make to further improve the response?
References
Campbell, Finlay, Anne Cori, Neil Ferguson, and Thibaut Jombart. 2019. “Bayesian Inference of Transmission Chains Using Timing of Symptoms, Pathogen Genomes and Contact Data.” PLoS Comput. Biol. 15 (3): e1006930.
Campbell, Finlay, Xavier Didelot, Rich Fitzjohn, Neil Ferguson, Anne Cori, and Thibaut Jombart. 2018. “Outbreaker2: A Modular Platform for Outbreak Reconstruction.” BMC Bioinformatics 19 (Suppl 11): 363.
Campbell, Finlay, Camilla Strang, Neil Ferguson, Anne Cori, and Thibaut Jombart. 2018. “When Are Pathogen Genome Sequences Informative of Transmission Events?” PLoS Pathog. 14 (2): e1006885.
Cori, Anne, Neil M Ferguson, Christophe Fraser, and Simon Cauchemez. 2013. “A New Framework and Software to Estimate Time-Varying Reproduction Numbers During Epidemics.” Am. J. Epidemiol. 178 (9): 1505–12.
Jombart, Thibaut, Anne Cori, Xavier Didelot, Simon Cauchemez, Christophe Fraser, and Neil Ferguson. 2014. “Bayesian Reconstruction of Disease Outbreaks by Combining Epidemiologic and Genomic Data.” PLoS Comput. Biol. 10 (1): e1003457.